Modeling health and well-being measures using ZIP code spatial neighborhood patterns

Individual-level assessment of health and well-being permits analysis of community well-being and health risk evaluations across several dimensions of health. It also enables comparison and rankings of reported health and well-being for large geographical areas such as states, metropolitan areas, and counties. However, there is large variation in reported well-being within such large spatial units underscoring the importance of analyzing well-being at more granular levels, such as ZIP codes. In this paper, we address this problem by modeling well-being data to generate ZIP code tabulation area (ZCTA)-level rankings through spatially informed statistical modeling. We build regression models for individual-level overall well-being index and scores from five subscales (Physical, Financial, Social, Community, Purpose) using individual-level demographic characteristics as predictors while including a ZCTA-level spatial effect. The ZCTA neighborhood information is incorporated by using a graph Laplacian matrix; this enables estimation of the effect of a ZCTA on well-being using individual-level data from that ZCTA as well as by borrowing information from neighboring ZCTAs. We deploy our model on well-being data for the U.S. states of Massachusetts and Georgia. We find that our model can capture the effects of demographic features while also offering spatial effect estimates for all ZCTAs, including ones with no observations, under certain conditions. These spatial effect estimates provide community health and well-being rankings of ZCTAs, and our method can be deployed more generally to model other outcomes that are spatially dependent as well as data from other states or groups of states.


Data
Here we present exploratory data analysis of the overall Well-Being Index (Overall WBI) and well-being across the five subscales (Physical, Financial, Social, Community, and Purpose) for ZCTAs across Massachusetts and Georgia.We find that ZCTAs around Boston as well as some in Western Massachusetts and Cape Code have the highest average scores for overall WBI and the five subscales (Supplementary Figures S1(a) -S1(f)).Generally, we find lower average scores in ZCTAs in southeastern Massachusetts and also west of Worcester, especially in the Financial subscale.Note that there are several ZCTAs with zero observations in western Massachusetts.In Georgia, we find ZCTAs in northern Georgia have the highest average scores.ZCTAs south of Atlanta and also southeastern Georgia having the lowest scores across overall WBI and most of the subscales, especially the financial, community, and purpose subscales (Supplementary Figures S2(a) -S2(f)).
Supplementary Table S1 shows more comprehensive statistics for average WBI and subscales across ZCTAs in the US, Massachusetts, and Georgia.We can see that Georgia generally performs reasonably well (based solely on the available well-being scores) across all subscales, with averages higher than both Massachusetts and the United States as a whole.Additionally, we also notice that the standard deviation for Georgia is lower that that of the US and Massachusetts, illustration a more homogeneous distribution across ZCTAs.

Methods
Construction of the graph Laplacian matrix.Note that the regularization term in our model involves a graph Laplacian matrix defined as L = D − A ∈ R S×S , where A is the (weighted) adjacency matrix and D is the corresponding degree matrix.The adjacency matrix weights are calculated as A ss ′ = exp −δ 2 ss ′ 2ω 2 , where δ ss ′ is the driving time between the population centroids of ZCTA s and s ′ , and ω is the driving time cutoff (30-minutes in our case).Thus, ZCTAs whose population centroids are more than ω minutes away are not considered neighbors and thus their weight is equal to zero.Additionally, the degree matrix is Supplementary Figure S1.Average values of overall WBI and the five subscales by ZCTA for Massachusetts.Gray ZCTAs are ones with zero observations.All maps of Massachusetts and Georgia in the Supplementary Materials were created using the tmap and tmaptools R packages. 1 diagonal and is defined as D ss = ∑ S s ′ =1 A ss ′ and D ss ′ = 0, which corresponds to the out-degree degree matrix.Diagonal entries in the out-degree degree matrix represent the total driving time to travel to all other ZCTAs that are defined as neighbors, which makes sense in this analysis since we are interested in how individuals residing in a particular ZCTA travel to neighboring ZCTAs to access resources and amenities.
Estimation of standard errors.Recall that in ordinary least squares (excluding the spatial regularization term), βOLS = ( XT X) −1 XT y and var( βOLS ) = σ 2 ( XT X) −1 .Since our model is simultaneously estimating β and α, let θ = β α .We have Therefore, the variance of the estimates for our model with spatial regularization is given as Standard errors are then calculated by taking the square root of the diagonal entries of the above variance-covariance matrix.
Non-contiguous ZCTAs ZCTAs can be non-contiguous 2 and to evaluate the effect of non-contiguous ZCTAs on driving times, we calculated the number of non-contiguous ZCTAs in Massachusetts and Georgia and the maximum distances between the non-contiguous parts of each ZCTA (Table S2).Overall, there are only 8 ZCTAs out of 539 (1.5%) in Massachusetts and 27 ZCTAs out of 751 (3.6%) in Georgia that are non-contiguous, and among those that are non-contiguous, 80% are separated by less than 1 kilometer.The farthest distance between any non-contiguous ZCTA parts in both states is just over 4 kilometers.
In the context of our spatial relationship definition of a 30-minute driving time (and 60 minutes in the sensitivity analysis), the error introduced by non-contiguity of ZCTAs, if any, is very likely negligible.Moreover, we used population centroids to represent the starting/ending points of the driving distances irrespective of ZCTA contiguity; our results would be the same if the non-contiguous ZCTAs were connected with an unpopulated polygon.

Simulation
We perform simulations to show that if spatial patterns exist at the ZIP Code level, then our model is able to successfully capture those spatial patterns under various sample size and noise settings.In this section, we evaluate the model's performance using a simulation study under different scenarios with Massachusetts as the reference state.Specifically, we will assess the performance of our model in terms of estimation, model fit, and prediction accuracy.In this section, we describe the overall simulation setup, construction of the design matrix, and the choice of ZIP Code spatial patterns.Additionally, note that for our simulations, we used ZIP Codes as the spatial unit instead of ZCTAs.

3/10
True Zip− code effect Supplementary Figure S3.Choropleth maps of the true ZIP Code spatial effects for each of the four spatial patterns we consider in the simulation study.
We assign these N observations to different ZIP Codes as follows.We obtain N samples (with replacement) of ZIP Codes from the MA WBI dataset in order to mimic the representation of MA ZIP Codes.Massachusetts has 536 ZIP Codes of which only 447 are included in the MA WBI dataset.That is, 89 ZIP Codes did not have any responses, hence, these ZIP Codes will never be sampled.Note that, since we are sampling ZIP Codes with replacement from the original MA data set, certain ZIP Codes will have low probability of being sampled in the simulated dataset, which could result in such ZIP Codes not being sampled due to random chance.Additionally, in this section, we use the Euclidean distance to inform the spatial neighborhoods with ω = 25 miles as the cutoff.Although we fix ω = 25 miles for the purposes of this simulation, ω can be assigned to examine how different distance thresholds effect estimation of regression coefficients and ZIP Code-level spatial effects.Note that we use driving time to inform spatial neighborhoods in the manuscript for real data analysis.Both approaches to define ZIP Code neighborhood (distance-based and driving time-based) can be used in practice.For the purpose of this simulation study, as a proof of concept, we used distance between ZIP Codes for the same, but it can readily be deployed with driving time-based neighborhood as well.
The ZIP Code-level spatial deviations αs are assigned under four different spatial patterns 3 : block, smooth, hotspots (Boston and Springfield), and random.Supplementary Figures S3(a The spatial effects for ZIP Codes are determined by the block they belongs to and are assigned as −3, −1, 1 and 3 for Blocks 1 to 4, respectively.In the smooth pattern, the spatial deviations steadily increase from αs = −3 to αs = 3 as we move from West to East, with Cape Cod, Martha's Vineyard, and Nantucket having the highest spatial deviation.In the hotspots pattern, we choose the cities of Boston and Springfield as hotspots.For the purpose of this simulation, we assign a positive spatial effect to Boston and negative spatial effect to Springfield.Additionally, the spatial effect of the ZIP Codes as we move further from Boston and Springfield, have a decreasing influence of the hotspots.Lastly, we consider a random spatial pattern in which spatial deviations are randomly sampled from a uniform distribution U {−3, 3}. We replicate the simulation 100 times for each combination of (N, σ ).However, for brevity, we will present results from the median simulation setting with (N = 5000, σ = 10).This setting is similar to the MA WBI data for the overall WBI scale in terms of N and sample size and standard deviationσ .

4/10
Zip−code effect Supplementary Figure S4.Choropleth of estimated ZIP Code spatial effects from one replication under the (N = 5000, σ = 10) simulation setting for each of the four different spatial patterns.

Model estimation
Overall, we find fairly accurate estimation of β coefficients.Most of the coefficient estimates are within 2-3% of the true value and estimation is consistent over all spatial patterns, as well as sample size and noise settings.Supplementary Figures S4(a) -S4(d) illustrates the ZIP Code-level spatial effect estimates from one replications of the (N = 5000, σ = 10) simulation setting.For both the block and smooth spatial patterns, we observe the spatial effects increasing as we move from West to East, with the smooth pattern depicting a slightly more continuous increase.Additionally, for the hotspots pattern, we find that the lowest spatial effects are concentrated near Springfield and the highest spatial effects are concentrated near Boston.Lastly, for the random spatial pattern, we find no discernible pattern in the spatial distribution of αs .Thus, for each spatial structure, the general pattern of estimated spatial effects mimics the structure of true spatial effects.Additionally, Supplementary Table S3 shows the Spearman correlation between αs and αs for each simulation setting and spatial pattern.As expected, we observe that the correlation increases as sample size increases and noise decreases.We find that the smooth pattern has the highest correlation (random pattern has the lowest correlation) since the spatial smoothness assumption is aligned with the assumptions of the model.
Although the Spearman correlations between true and estimated spatial effects are high, there are some ZIP Codes with large discrepancies, especially in the block and hotspots patterns.Supplementary Figures S5(a) -S5(d) depict the spatial effect residuals ( αs − αs ) for ZIP Codes from one replication under the (N = 5000, σ = 10) simulation setting in each of the four spatial patterns.In the block structure, we notice that the magnitude of the spatial effect residuals are highest at ZIP Codes near the border of the blocks.This is expected as true spatial effects sharply increase by two units as the blocks move from West to East.When observing the map of spatial effect residuals in the smooth and hotspot structures, we don't find any evident spatial pattern.The only major feature we notice is that ZIP Codes with the highest true spatial effect in terms of absolute value tend to have larger spatial effect residuals and this is due to attenuation of spatial effects from the model.Lastly, we find no discernible pattern in the distribution of residuals for the random structure.
One of the advantages of our model is that it returns estimates of spatial effects for ZIP Codes no responses due to the spatial neighborhood based penalty.Supplementary Figure S6 illustrates the median absolute value and interdecile range of spatial effect residuals for ZIP Codes that were missing and present in the data.For every sample size, noise, and pattern setting, we find that the magnitude of spatial effect residuals is larger for ZIP Codes with zero responses -but still reasonable given that those ZIP Codes do not have any observations.for each of Massachusetts's 536 ZIP Codes.Thus, each of the three covariates was randomly generated as described previously and the outcome variable y was simulated using the same set of true β coefficients and a spatial effect α s that depends on the ZIP Code of the observation.Supplementary Table S5 shows the average RMSE of the test dataset.We can see that the average RMSE is roughly equal for each spatial pattern and also approximately equal to the noise.This means that the average difference between the predicted outcome and the true outcome is about one standard deviation.
Another metric we consider for prediction is the correlation between the true values in the test set and the predicted values.As we can see, the average correlation is highest for the low noise (σ = 5) setting with correlations around 0.9 for all spatial patterns and sample settings (Supplementary Table S6).In the medium (σ = 10) and high (σ = 20) noise settings, the correlation is around 0.72-0.73and 0.46-0.49for all spatial patterns and sample size settings, respectively.

Adjusted R 2 for real data analysis
We consider the adjusted R 2 for models with regularization 4 .The formula for the adjusted R 2 is: , ] and is the effective degrees of freedom.We notice that for Massachusetts, the adjusted R 2 values are higher than that of Georgia for each subscale (Supplementary Table S7).Additionally, for both Massachusetts and Georgia, the financial subscale has the highest adjusted R 2 at 0.261 and 0.154, respectively and the lowest adjusted R 2 is in the purpose subscale, with values of 0.078 and 0.063, respectively.

Sensitivity analysis with 60-minute driving time threshold
In the data analysis for Massachusetts and Georgia presented in the article, we used a 30-minute driving time cutoff between ZCTA population centroids to inform our neighborhood structure.The driving-time cutoff however can be parameterized and different driving times may be more appropriate for certain geographies.Here, we perform a sensitivity analysis using a 60-minute driving time threshold for the state of Georgia.Firstly, we observe minimal differences in the coefficient estimates for most of our predictor variables, with noticeable differences in the magnitude observed only for Education and Urban (Table S8).Additionally, we can see that the overall spatial distribution of ZCTA effects is fairly similar between the 30-minute and 60-minute cutoffs (Supplementary Figures S7(a)-S7(b)).The top ranked ZCTAs are still around northeast Georgia and we still find many of the lowest ranked ZCTAs south of Atlanta and in the southwestern region of the state.The magnitude of the estimated ZCTA effects for the two driving time cutoffs are slightly different (as expected), that is, the estimated ZCTA effects with the 60-minute cutoff are generally smaller in magnitude compared to the 30-minute cutoff.This is expected since each ZCTA has more neighbors influencing its effect estimate with a 60-minute cutoff leading to a higher degree of spatial smoothing.
Although the ZCTA effects are estimated from smaller numbers of observations than the beta values and show some differences, the ZCTA quintile rankings are mostly consistent.Supplementary Table S9 shows the proportion of ZCTAs that belong to each quintile under a 30-minute and 60-minute driving time cutoff.Of ZCTAs ranked in the first quintile when using a 30-minute driving time cutoff, about 65% of them remain in the top quintile with a 60-minute driving time cutoff, and about 88% stay in the top two quintiles.Similarly, out of ZCTAs in the bottom quintile with a 30-minute driving time cutoff, 74% remain in the bottom quintile with a 60-minute driving time cutoff.In the middle quintiles, many of the ZCTA effect estimates are close to zero and small changes in these estimates may lead to a change in quintiles.We recognize that there are several aspects of the data and geography that affect ZCTA effect estimation, and careful consideration of an appropriate driving time threshold is required.
Average values of overall WBI and the five subscales by ZCTA for Georgia.Gray ZCTAs are ones with zero observations.
) -S3(d) illustrate these four spatial patterns.For the block structure, Massachusetts ZIP Codes are categorized into one of four vertical blocks determined by the longitudes of the ZIP Code centroids.The only exception is that the ZIP Codes north of Boston with a longitude greater than -70.84 are placed in Block 3 instead of Block 4 for spatial continuity of Block 4. Of the 536 ZIP Codes in MA, 81 are in Block 1, 107 are in Block 2, 271 are in Block 3, and 77 are in Block 4.